Enhancing explainable SARS-CoV-2 vaccine development leveraging bee colony optimised Bi-LSTM, Bi-GRU models and bioinformatic analysis

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a single-stranded RNA virus that caused the outbreak of the coronavirus disease 2019 (COVID-19). The COVID-19 outbreak has led to millions of deaths and economic losses globally. Vaccination is the most practical solution, but finding epitopes (antigenic peptide regions) in the SARS-CoV-2 proteome is challenging, costly, and time-consuming. Here, we proposed a deep learning method based on standalone Recurrent Neural networks to predict epitopes from SARS-CoV-2 proteins easily. We optimised the standalone Bidirectional Long Short-Term Memory (Bi-LSTM) and Bidirectional Gated Recurrent Unit (Bi-GRU) with a bioinspired optimisation algorithm, namely, Bee Colony Optimization (BCO). The study shows that LSTM-based models, particularly BCO-Bi-LSTM, outperform all other models and achieve an accuracy of 0.92 and AUC of 0.944. To overcome the challenge of understanding the model predictions, explainable AI using the Shapely Additive Explanations (SHAP) method was employed to explain how Blackbox models make decisions. Finally, the predicted epitopes led to the development of a multi-epitope vaccine. The multi-epitope vaccine effectiveness evaluation is based on vaccine toxicity, allergic response risk, and antigenic and biochemical characteristics using bioinformatic tools. The developed multi-epitope vaccine is non-toxic and highly antigenic. Codon adaptation, cloning, gel electrophoresis assess genomic sequence, protein composition, expression and purification while docking and IMMSIM servers simulate interactions and immunological response, respectively. These investigations provide a conceptual framework for developing a SARS-CoV-2 vaccine.

identified the essential features for predicting epitopes and the role of all features using the Shapely Additive Explanations (SHAP) method.
Here are some novel research contributions for this study on epitope prediction for SARS-CoV-2 vaccines: 1.In the first stage of the study, Recurrent Neural Networks (Bidirectional Long Short-Term Memory (Bi-LSTM) and Bidirectional Gated Recurrent Unit (Bi-GRU)) were built from scratch. 2. Novel hybrid models (CNN-Bi-LSTM, CNN-Bi-GRU, BCO-Bi-LSTM, BCO-Bi-GRU, BCO-CNN-Bi-LSTM, BCO-CNN-Bi-GRU) were developed to improve the performance of the models and find the best-performing model.3. To increase the accuracy of epitope prediction, numerous additional information was used besides the SARS-CoV-2 proteins' sequence information.For instance, we integrated structural and chemical information to offer a more thorough picture of probable epitopes and boost prediction accuracy.4. To demonstrate the quality of the vaccine, we further examine its toxicity, potential antigenic properties, possibility of allergic reactions, and other biochemical parameters. 5.An adaptation of codons and cloning are also employed to examine the vaccine's genomic sequence and protein composition and guarantee its efficient expression.After the vaccine's 3D structure was generated with I-TASSER and galaxyWeb, docking was used to show how the vaccine interacts with its receptor.Finally, using the IMMSIM server, the immunological response anticipated from the vaccination was simulated.6.An explainable AI technique was employed to comprehend how the unique method's epitope predictions were created.Although deep learning methods have shown great promise in medical applications, they might be thought of as "black box" models because they do not reveal the process by which they make their predictions.Here, the SHAP technique was used to help researchers determine what attributes are most important for epitope classification.Explainable AI techniques have the potential to produce predictions that are simple to understand.

Datasets
The datasets used in this study were made publicly available and came from the Kaggle database.The SARS-CoV, B-cell, and SARS-CoV-2 datasets are all in the database (https:// www.kaggle.com/ datas ets/ futur ecorp orati on/ epito pe-predi ction).They have ten characteristics, with structural and chemical aspects comprising the data.Chou-Fasman (beta turn), Kolaskar-Tongaonkar (antigenicity), Parker (hydrophobicity), Emini (relative surface accessibility), Stability, isoelectric_point, Aromaticity, and Hydrophobicity are numerical.On the other hand, each protein sequence or peptide sequence will have a value corresponding to the number of their categorical letters.The sample of datasets can be visualised in Table 1.

Models
Bidirectional long short-term memory (Bi-LSTM) Recurrent neural networks of the type known as Bidirectional Long Short-Term Memory (Bi-LSTM) 40 can handle sequential data in both forward and backward orientations.The Bi-LSTM is primarily employed in studies involving speech recognition and natural language processing, where understanding each word's context is crucial.The Bi-LSTM model consists of two LSTM layers, one of which processes the input sequence forward and the other of which processes it backwards.The final output is created by concatenating the results of each layer.The forward and reverse states are calculated by the Bi-LSTM model using the following equations: a. Forward LSTM equations: (1) The forward and backward hidden states at each time step are concatenated to create the Bi-LSTM model's final output, which is then processed through a fully connected layer to provide the final prediction.
In conclusion, the Bi-LSTM model is a kind of recurrent neural network applied to sequential input's forward and backward processing.The model's two LSTM layers compute the forward and backward states.Based on the input sequence and the model's weights and biases, the equations employed in the Bi-LSTM model update the hidden and cell states of the LSTM layers.The forward and backward hidden states of the Bi-LSTM model are combined, and then they are sent through a fully connected layer to get the final output.

The bidirectional gated recurrent unit (Bi-GRU)
The Bidirectional Gated Recurrent Unit (Bi-GRU) 41 is a type of recurrent neural network (RNN) that can capture dependencies in both the forward and backward directions.It is commonly used in tasks that involve sequential data analysis, such as natural language processing and speech recognition.The Bi-GRU model consists of two GRU layers: one that processes the input sequence in the forward direction and another in the backward direction.The outputs of both layers are combined to form the final output of the Bi-GRU model.
where Reset Gate is r t , Update Gate z t , Candidate Activation h t , Hidden State h t .Similarly, the backward GRU layer has its own set of equations: where Reset Gate r' t , Update Gate z' t , Candidate Activation h t , Hidden State h' t .The forward and backwards hidden states at each time step, h t and h' t, respectively, are concatenated to create the Bi-GRU model's final output.This output can then be processed further, such as passing it through a fully connected layer, to obtain the final prediction.
(2) www.nature.com/scientificreports/In summary, the Bi-GRU model incorporates two GRU layers: forward processing and backward processing of the input sequence.The equations provided an update on the hidden states of the GRU layers based on the input sequence and the model's weights and biases.The forward and backward hidden states are combined to form the final output of the Bi-GRU model, enabling it to capture dependencies in both directions and make predictions based on the sequential data.

Bee colony optimization (BCO)
Several equations and algorithms are used in the Bee Colony Optimization algorithm to optimise the variables and configurations of a trained model.The fitness function, which assesses the quality of a solution based on the model's performance on a specific task or dataset, is one of the critical equations used in BCO 42 .For example, the fitness function can be adjusted to the particular issue to maximise precision or minimise inaccuracy.The exploration phase of the BCO algorithm, in which bees execute local searches by perturbing the model's parameters or hyperparameters, is another crucial algorithm.Mathematical operations such as addition, subtraction, multiplication, or mutation are frequently used to achieve this disturbance.Through a system comparable to a waggle dance, the bees exchange information about the quality of their solutions with one another.This information exchange makes the successful exploration and exploitation of various locations in the solution space possible.The best solutions are chosen during the exploitation phase based on their fitness values, and these solutions are used as the foundation for creating new solutions throughout the following iterations.Based on the problem domain and optimisation objectives, the particular equations and techniques employed in BCO can be tailored, resulting in practical parameter tuning and enhanced model performance.
1. Initialisation Assume that N is the population size and each bee represents a potential model solution with a set of hyperparameters or parameter configurations.2. Fitness Function A solution S' s quality is assessed by the fitness function, denoted by the symbol f S.

Local Search (Exploration Phase)
Local search involves perturbing the model's parameters or hyperparameters.Assume P is a perturbation function and S is the solution. 4. S′ = S + P , where P is a random value within a given range, is the parameter perturbation.5. S′ = S * P , where P is a random number within a given range, is the hyperparameter perturbation.
Information Sharing: Bees talk to one another by exchanging details on the calibre of their solutions.S i and f Si should stand in for the solution and its fitness value for the i th bee.
They are choosing the Best Solutions and Generating New Solutions (Exploitation Phase).During exploitation, the best solutions are chosen, and fresh solutions are created.In Roulette Wheel Selection, a solution's f Si fitness value concerning the fitness sum as a whole determines the likelihood of selecting it.Using crossover operators, new solutions are created by mixing the parameters of previously chosen solutions.Random changes are made to the parameters of chosen solutions to explore new areas of the solution space.

Training
Bidirectional Long Short-Term Memory (Bi-LSTM) and Bidirectional Gated Recurrent Unit (Bi-GRU) models of recurrent neural networks (RNNs) 40,43 were used in this study to classify protein sequences on datasets related to B-cells and SARS-CoV.The training procedure for each dataset was split into four steps, each integrating different techniques to improve the performance of the RNN models.
It can be challenging to detect long-term dependencies because of the vanishing gradient problem, in which gradients get smaller and smaller throughout backpropagation.The Bi-LSTM and Bi-GRU models were individually trained in the first step.These models' internal memory cells enable them to detect long-range relationships in sequential data.In the second stage, a hybrid approach combining Bi-LSTM and Bi-GRU with a method called Bee Colony Optimization (BCO) was used.BCO is an optimisation algorithm inspired by the behaviours of bees, as stated.It helps to fine-tune the parameters of the RNN models, improving their performance.Each standalone RNN model was optimised using BCO and trained separately (see Fig. 1).
The third step involved the addition of a Convolutional Neural Network (CNN) 44 together with Bi-LSTM and Bi-GRU.Deep, feed-forward neural networks called CNNs can learn hierarchical spatial representations without the aid of manually created feature sets.Convolutional layers were used to extract pertinent local attributes from the input data using CNNs applied to each Bi-LSTM and Bi-GRU model.In the last phase, Bi-LSTM and Bi-GRU connected with BCO-CNN, previously paired with CNN (BCO-CNN).This strategy aimed to improve each of the models' performance further by combining the benefits of BCO and CNN approaches.
After the input layer, an embedding layer was used to handle the input features effectively.The parameters of the embedding layer are carefully chosen to minimise training error, and it maps the input features to a higherdimensional space.Several strategies were used during the training to enhance the model's generalizability and avoid overfitting.The vanishing gradient problem was addressed by adding batch normalisation layers before the CNN layer, which speeds up training.Early stopping with a patience of 3 was used to end training when the accuracy or loss stopped increasing, and a dropout layer was used to prevent overfitting.
The study aimed at epitopes that might be used in vaccine development using the SARS-CoV-2 datasets.The necessary bioinformatics tools were used to analyse the identified epitopes further.ToxinPred 45 for toxicity verification, AllerTOP2.0 46for potential allergens in the predicted epitopes and VaxiJen 47 for recognising antigens from the predicted epitopes.The proposed method and the overall workflow of the study can be visualised in Fig. 2,  www.nature.com/scientificreports/

SHAP approach
The deep black box model's predictions are explained using the SHAP approach, also known as Shapely Additive Explanations.The average percentage of each feature's contribution to the prediction value is mathematically known as the shapely value for that feature.This may be calculated by comparing the prediction value with and without the feature in each scene where it might interact with other features, then taking the average of those contributions as the SHAP value for that feature.The process is repeated for all other features to find their respective contributions using the Shapley kernel 48,49 .The analysis made use of the SARS-CoV-2 dataset.The contribution of every feature in the peptide sequences identified as epitopes or not was examined.With this approach, the model must be trained on every feature subset S ⊆ F , where F is the collection of complete features.Every feature is given a significance value that reflects the impact of incorporating that component on the model's prediction.Two models, f S∪{i} and f S are trained with the feature present and hidden, respectively, and are used to calculate the effect.Then, based on the current input f S∪{i} (x S∪{i} ) − fS(xS) , the projected outcomes from the two models are compared, in which x S represent the values of the input features in the set S. The differences mentioned above are calculated for all potential subsets S ⊆ F\{i} since the effect of omitting a feature relies on other characteristics in the model.Once calculated, the Shapley values are applied as feature contributions.The weighted average of all potential differences makes up these outcomes 48 .

Investigation of allergenicity and antigenicity
Using the VaxiJen 2.0 server, the antigenicity of the finished vaccine will be evaluated (http:// www.ddg-pharm fac.net/ vaxij en/ VaxiJ en/ VaxiJ en.html) 47 .VaxiJen is the first service to forecast tumour, viral, and bacterial protective antigens without considering alignment.The models on the server were produced by employing autocovariance (ACC) to pre-process amino acid properties.To evaluate allergens from the anticipated epitopes, the AllerTOP2.0server will be used 46 .The server was created to provide alignment-independent models for allergen identification based on the fundamental chemical characteristics of the sequences of amino acids (https:// www.ddg-pharm fac.net/ Aller TOP/).

Evaluation of toxicity and biochemical characteristics
Support vector machine (SVM) technology was used to construct the ToxinPred 45 tool, which was used for the evaluation, and a score of 0.0 is considered non-toxic.To identify epitopes as poisonous or not, the ToxinPred score considers physiochemical characteristics of the epitope, such as molecular weight, hydrophilic nature, and potential mutations (https:// webs.iiitd.edu.in/ ragha va/ toxin pred/).The physicochemical properties of the final vaccination will be predicted using the ExPASy ProtParam server 50 .The physicochemical aspects include the half-life and instability index (https:// web.expasy.org/ protp aram/).

Multi-epitope BLAST screening
A BLAST was used to assess protein similarity to human proteins and lower the likelihood of autoimmunity.The UniProtKB Human database received the vaccine sequence for the blast investigation (https:// www.ebi.ac.uk/ Tools/ sss/ ncbib last/).Moreover, with the use of the Pipeline Builder for Identification of Drug Targets (PBIT) (http:// www.pbit.bicni rrh.res.in), we submitted the vaccine protein against the proteome of frequently occurring microbes of the gut considering the role the microorganisms play in safeguarding health.Proteins with an e-value threshold greater than 0.005 and less than 50% of a given sequence shared with the intestinal microbiome proteome were deemed non-homologous 51 .The final vaccine consists of a 50S ribosomal protein L2 52 adjuvant for improving antigenicity (accession no.AXI95322.1)joined to the amino (N) terminus of the multi-subunit sequence by an EAAAK linker 53 .GPGPG linkers link ten B-cell epitope subunits together.A 6xHis tag is added at the C-terminal to facilitate protein purification and identification 54 .

Prediction of secondary structure and solubility
The PSIPRED online programme (http:// bioinf.cs.ucl.ac.uk/ psipr ed/) was used to generate the secondary structures of the vaccine structure.It is an online server secondary structure generation tool that uses two feed-forward neural networks to predict protein structure.Furthermore, it effectively predicts transmembrane helix, fold, transmembrane topology, and domain recognition, among other things 55 .Additionally, the Protein-Sol server (https:// prote in-sol.manch ester.ac.uk) employed a population average (PopAvrSol) of 0.45 to assess the solubility of a multi-epitope vaccine, with values greater than 0.45, suggesting improved solubility.The predicted scaled solubility value (QuerySol) of the protein will determine how soluble it is 56 .

Predicting tertiary structures
The vaccine's tertiary or three-dimensional (3D) model was made using the homology modelling program I-TASSER (Iterative Threading Assembly Refinement) platform (https:// seq2f un.dcmb.med.umich.edu//I-TASSER/).It is an integrated platform for computational protein structure and function prediction based on the sequence, structure, and function approach.It leverages the Protein Data Bank (PDB) to find similar structural patterns 57 .

Tertiary structure refinement
Using the GalaxyRefine web server (http:// galaxy.seokl ab.org/ cgi-bin/ submit.cgi? type= REFINE), the vaccine peptide's 3D model will be enhanced.Based on refining techniques that were successfully tested in CASP10based refinement studies, the GalaxyRefine server was created, and the structure's relaxation was accomplished by repacking and molecular dynamics modelling.When applied to modern protein structure prediction models, this method can improve the overall standard of local and global structures 58 .The Molprobity score, GDT-HA score, RMSD score, and Clash score are used to assess the quality of the revised model.

Immune receptors and the vaccine's docking
A widespread tool for protein-protein docking, ClusPro (https:// clusp ro.bu.edu) was utilised for molecular docking.After docking with each combination of energy parameters, ten models are generated and clustered around populations of low-energy docked structures 59 .TLR4 with PDB ID: 2Z63 is the immunological receptor of choice.The docked unit with the lowest energy is picked among the ten ratings.PyMOL was used to visualise the 3D structure of the key interacting residues.

Evaluation of codon adaptation and cloning
JCat (http:// www.jcat.de) is an innovative approach for increasing protein output by identifying and optimising the codons of the target gene to adapt to different sequenced prokaryotes and particular hosts for eukaryotic gene expression.The process of optimising a particular sequence is split into two steps by JCat.First, the sequence is checked to determine if it matches a recognised amino acid or gene sequence.A series of amino acids is then translated from it.The codons with the highest relative adaptiveness for the pertinent amino acid for expression in the host are used to turn the amino acid sequence into a gene sequence in a subsequent step 60 .Next, the vaccine's codons will be cloned into the pRSFDuet-1 vector using the Snapgene application to achieve the most significant possible expression within E. coli (available at (https:// www.snapg ene.com/ free-trial/ ).

Polymerase chain reaction with agarose gel electrophoresis simulation
Using SnapGene, the primers were designed according to the Tm value and the length.The typical primer length is 20-23 bp, the Tm value is selected at 55-65 °C, the annealing temperatures are 1 °C, the GC contents are around 50-65%, and a protective nucleobase is inserted at the 5′ end.Lastly, using SnapGene, the recombinant plasmid was used to simulate agarose gel electrophoresis (https:// www.snapg ene.com/ free-trial/).

Immune response to vaccine simulation
The vaccine immune response profile will be made available through the C-ImmSim internet simulation service.
The vaccination triggers a comparable immune reaction when it enters the body as an antigen.C-ImmSim was utilised to replicate the magnitude and nature of immunological reactions triggered by the MEV in people.C-ImmSim will analyse a mammalian immune system's humoral and cellular reaction after the vaccine's initial booster dose.(https:// kraken.iac.rm.cnr.it/C-IMMSIM/ index.php? page=1) 61 .

Results and discussion
The performance of several models on the SARS dataset is presented in Table 1 and is assessed in terms of accuracy and AUC (Area Under the ROC Curve).The Bi-LSTM model's accuracy and AUC were 0.8462 and 0.8545, respectively.With the SARS dataset, the Bi-LSTM does an excellent job of accurately classifying the epitopes.The long-term dependencies in sequential data are well-represented by the LSTM architecture.The Bi-GRU model achieved an accuracy of 0.77 and an AUC of 0.894, even though it is noticeably less accurate than the Bi-LSTM model.Applying Bee Colony Optimization improves the BCO-Bi-LSTM model's performance with an AUC of 0.944 and an accuracy of 0.92.The BCO-Bi-LSTM model shows improved performance.It implies that the optimisation process helps in identifying optimal model configurations.The BCO-Bi-GRU model has an accuracy and an AUC of 0.8846.It works much less than BCO-Bi-LSTM, although it produces better results than the basic Bi-GRU model.It demonstrates that BCO optimisation can be favourable for GRU and LSTM architectures.
The BCO-CNN-Bi-LSTM and BCO-CNN-BI-GRU models had accuracy scores of 0.8 and 0.826, respectively, and AUC scores of 0.839 and 0.8864, respectively.See Table 2. Combining BCO and Convolutional Neural Networks (CNN) with either Bi-LSTM or Bi-GRU, these models provide competitive performance.With CNN layers added, the models can identify regional patterns and extract meaningful information from the input data.
The CNN-Bi-GRU model has an accuracy and an AUC of 0.817.Its performance is moderate when compared to the other models in the table.This implies that the combination of CNN and GRU may not be as effective for this dataset.Further research might be needed to improve performance, or the model might need to be modified.The accuracy and AUC of the CNN-Bi-LSTM model are 0.8654 and 0.8854, respectively.It performs admirably, much like the Bi-LSTM model.Accurate predictions are made due to the effective capture of spatial and temporal dependencies in the data by the CNN and LSTM combo.
The results show how effectively LSTM-based models, specifically Bi-LSTM and BCO-Bi-LSTM, classified the SARS dataset.CNN layers and BCO tuning improve performance even more.The performance of Bi-LSTMbased models is slightly superior to that of Bi-GRU models.These findings provide helpful direction for choosing the most appropriate models for SARS epitope prediction-related tasks.The outcomes from both tables show how well LSTM-based models, particularly LSTM, capture sequential patterns in the Bcell and SARS datasets.Additional layers like CNN and optimisation methods like Bee Colony Optimization can be added to increase performance even more.However, the lower accuracy and AUC scores suggest that the Bcell dataset is more www.nature.com/scientificreports/challenging to capture.This implies that in order to get better results on the Bcell dataset, more investigation and model tuning could be necessary, as the results show in Table 3.

An explainable deep model with SHAP
The deep models here may be thought of as "black boxes" because peptides are fed into the models, and predictions about whether or not they are epitopes are retrieved from the last layer without any explanation of the decision-making process.Shapley values were computed for each feature to understand each feature's role in forecasting.Determining the average value contributed by every feature will assist in explaining why the black box model produced such forecasts.Lower magnitude numbers or negative values indicate less relevance in the forecast, and the more positive the SHAP value, the more significant it is in making the prediction 48 .Therefore, we learned that peptide sequence is crucial in the prediction task.This is very important since each peptide sequence has its component amino acids.Next is the isoelectric point, chou_fasman (beta turn), protein sequence, and relative surface accessibility or emini see Fig. 3.
The prediction task's most crucial element was determined to be the peptide sequence.This suggests that whether a peptide is categorised as an epitope or not is strongly influenced by the exact arrangement of amino acids inside it.The model probably picked up peptide sequence motifs or patterns pointing to epitopes.As a result, different amino acid combinations or variants play a substantial role in the model's choice-making process.Identifying epitopes also requires consideration of the isoelectric point.The pH level where the peptides' net charges are zero is known as the isoelectric point.According to the feature's positive SHAP value, epitope prediction may be positively influenced by particular isoelectric point ranges or values.It suggests that the peptide's charge distribution affects the strength of its ability to impact an epitope.Next, an important characteristic is the Chou_Fasman score, notably the Beta Turn conformation.The fact that this feature has a positive SHAP value shows that specific motifs or traits connected to beta turn conformations are predictive of epitopes.The model probably discovered that secondary structural patterns or particular combinations of amino acids are connected to beta turns and help predict epitopes.In addition to the peptide, the protein sequence also plays a role in the prediction process.It captures the larger context of the protein the peptide belongs to, even if it can contain information that matches the peptide sequence.The model could have picked up on certain features or patterns in the protein sequence that help predict epitopes.Another essential element that Emini refers to is thought to be relative surface accessibility.The information this characteristic provides describes the amino acids exposed to or readily available on the protein's surface.The fact that this characteristic has a positive SHAP value shows that the prediction of epitopes may be affected by specific accessibility rates or combinations of exposed amino acids.The model probably discovered that, in contrast to non-epitopes, epitopes frequently have distinctive surface properties.
In general, the interpretation of the SHAP data emphasises the crucial elements that help the model forecast epitopes.Lower magnitude or negative SHAP values imply less relevance, whereas positive SHAP values highlight the significance of particular feature values or patterns in the forecast.By providing insights into the underlying mechanisms influencing the predictions, this knowledge aids in understanding how and why the black box model generated its predictions.

Allergenicity and antigenicity assessment
The Vaxijen 2.0 internet server grades the resulting multi-epitope vaccine sequence's antigenicity.We assess each vaccine component, including the adjuvant, for its antigenicity (see Table 4).With a threshold of 0.4, the peptide is said to be an antigen, according to the Vaxijen tool.The Vaxijen score for the entire final vaccine is 0.8772, indicating that our final vaccine has a high level of antigenicity.ANTIGENpro was further utilised to verify the predictions made by the Vaxijen tool (see Table 4), and it confirmed the vaccine's antigenicity with a 0.8959 prediction score.AllerTOP 2.0 server forecast that the finished vaccine and each of its parts and adjuvant will be allergy-free (see Table 5).www.nature.com/scientificreports/

Toxicity and biochemical characteristics
Vaccine safety is a priority.Therefore, it must not have a high potential for toxicity, and its physicochemical characteristics must also be considered when assessing its interactions with biological systems 62 .The toxinPred server is used to forecast toxicity.The findings of our examinations of each subunit's physicochemical and toxicological characteristics are displayed in Table 6.Each subunit was found to be safe.Therefore, there are no hazardous component peptides in either vaccine.ExPASy ProtParam Tool also predicts molecular weight, hydropathicity, charge, half-life, instability index, and pI (theoretical isoelectric point value).The completed vaccine's estimated hydropathicity value is − 0.812.This low number suggests that our final vaccine will be hydrophilic and efficiently bind to water molecules 17 .The finalised vaccine's half-life is anticipated at 30 in vitro hours, while for in vivo, it is more than 20 h.Our final vaccine is stable since the expected Instability Index is 29.69, which is below the cut-off of 40.With a calculated pI of 9.80, the finished vaccine is highly basic and alkaline.The molecular mass of the finished vaccine is found to be 54880.72Da.

Homology analysis and the assembling of a finalised multi-epitope vaccine candidate
We conduct a BLAST search on all 11 vaccine subunits utilising the Uniprot database to rule out probable autoimmunity.A subunit with more than 35% identity with the human proteome will be regarded as a homologous protein.None of the 11 vaccine components we ultimately decided to use in the vaccine production exhibits significant similarity with the human proteome.See Table 7 for results.The 50S ribosomal protein L7/L12 showed significant identity with the multi-epitope vaccine compared to the gut microbiota during the search for similarities, while other components did not show any homology.To improve the immune response, the finished vaccine contains an adjuvant, 50S ribosomal protein L2 (accession no.AXI95322.1),which is joined to the amino (N) terminum of the multi-subunit sequence utilising the EAAAK linker.There are 509 amino acid residues in the completed vaccine.10 B-cell epitope subunits are fused via GPGPG linkers 53 see Fig. 4. To aid in the purification and identification of the protein, eventually, the C-terminal is tagged with a 6xHis tag 54 .

Secondary structure and solubility prediction
The PSIPRED server was used to predict the secondary structure of the vaccine, which has 8.34% alpha helices, 18% beta strands, and 73.6% coils (see Fig. 5A).Finally, the vaccine was predicted to be soluble by the Sol-Pro  www.nature.com/scientificreports/tool.According to Fig. 5B, the estimated scale solubility value of 0.543 indicated good solubility because it was higher than the population average (PopAvrSol) of 0.45.

A favoured vaccine's multi-epitope 3D structure
The I-TASSER server begins modelling with structure templates located in the PDB database.Despite the server's ability to generate thousands of template alignments, it only uses the best ones based on accuracy where Z-score > 1 = good alignment.The five likely tertiary structures associated with the multi-epitope vaccine were predicted using the top 10 templates, having Z-scores varying from 1.78 to 6.21.Each of these five models' unique  www.nature.com/scientificreports/

Adaptation of codons and cloning
Using the Java Codon Adaptation Tool, we assessed the expression efficiency and the codon utilisation of the vaccine design for cloning in E. coli strain K12.A total of 1412 nucleotides make up the optimised codon sequence.It has a Codon Adaptation Index (CAI) of 1.0, which is within (0.8-1.0), and an average GC content of 52.46%, which is within the ideal value (30-70%), both of which indicate a high possibility that the final vaccine would be adequately produced in the E. coli host.Using the SnapGene tool, we introduced the codon sequences into the pRSFDuet-1 vector see Fig. 8.It is placed between PciI (1782) and BstEII (3I94) locations in the vector.The final vaccine's codon sequence is in red, and the pRSFDuet-1 expression vector is in black.

Simulation of agarose gel electrophoresis and polymerase chain reaction
A gel electrophoresis simulation was carried out to confirm whether the multi-epitope vaccine will be purified.
The forward primer (5′-CAT GTC TGG TTC TGA CTT CG-3′) has a length of 20, a Tm value of 55 °C, and a GC content of 50%.The reverse primer (5′-CCG GAC CCG GAC CTT TAC GGAAC-3′) has a length of 23, a Tm value of 65 0 C, and a GC content of 65%.These primers were created based on the parameters above.SnapGene was used to amplify the MEV's target gene.We used a concentration of 1% for the simulated agarose electrophoresis, and we chose TBE based on its superior capacity to stabilise the target gene, vector, and recombinant plasmid in a buffer solution.Ultimately, the quantity of DNA matched earlier estimates.In the end, the amount of DNA matched with previous estimates.The amplified MEV sequence was 1412, and the cloned pRSFDuet-1 plasmid with MEVsequence was 3871 bp.After digestion of pRSFDuet-1 plasmid with PciI and BstEII enzymes, 2459 bp and 1412 bp sequences were obtained (Fig. 9).

Simulation of immune response to vaccine
Through the C-ImmSim server, our vaccine design significantly simulated the mammalian immune response (Fig. 10 A-E).The results also showed that our multi-epitopic vaccine produced a healthy number of B cells that secrete antibodies, with the IgM and IgG (Fig. 10A) subclasses being the most noticeable.The profile of cytokines (Fig. 10B) generated following the injections was one noteworthy finding.A dramatic surge of pro-inflammatory IFN-g and IL-2 after repeated vaccination was seen.Also, some anti-inflammatory cytokines, such as TGF-b and IL-6, showed peaks.Plasma cell (Fig. 10C) and T-cell (Fig. 10E) production were reasonably large, and memory cells (Fig. 10D) remained viable for several months.Therefore, our findings show that immune reactions gradually increased after a booster vaccination dose and lasted for several months.

Conclusion
This study suggested a deep learning framework built on a biological optimisation algorithm for choosing epitopes that may be used in vaccine development.The DL-based algorithms have demonstrated exceptional performance for the prediction job and may thus be used to quickly and cheaply choose the appropriate epitopes to develop a SARS-CoV-2 vaccine.A multi-epitope vaccine against COVID-19 was created using the epitopes predicted by the model.The finalised vaccine's toxicity, potential allergic reactions, and other physiochemical characteristics were examined and determined to be safe.Additionally, it has high antigenicity, which is crucial for

Figure 1 .
Figure 1.Proposed model designs.First, B-cell and SARS datasets were used to train the models in all four steps.Then, the best model was selected for the prediction of epitopes on the SARS-CoV-2 dataset.

Figure 3 .
Figure 3. SHAP values plot for feature importance.The SHAP value's magnitude reveals how significant a feature is to the model's prediction.A more significant absolute SHAP value indicates a higher relevance.A positive SHAP value and vice versa show the contribution of a feature to a more excellent prediction value.

Figure 4 .
Figure 4. Multi-epitope vaccine schematic diagram.The multi-epitope vaccine sequence, which is 509 amino acids long, links an adjuvant (green) at the N-terminal end with the multi-epitope sequence using an EAAAK linker (purple).GPGPG linkers (yellow) were used to link ten epitopes at the tail, a poly His tag was lastly inserted.

Figure 5 .
Figure 5. Predictions about the secondary structure of the vaccine construct and on solubility analysis.(A) The secondary structure prediction indicated that the likelihood of the protein forming antigenic epitopes is indicated by the high percentage of β strands 18% and random coil 73% seen in the MEV (B) ProtSol predicted the solubility value of 0.543.

Figure 7 .
Figure 7. (A) Molecular docking depicts a ligand-receptor complex with the TLR4 (PDB ID: 2Z63) receptor and the multi-epitope vaccine as the ligand.The cyan colour represents the vaccine, and the brown represents the receptor.(B) Using the visualising programme PyMol, the interactions between residues in the MEV and TLR4 complex were examined, and their 3D image was captured.

Figure 10 .
Figure 10.Utilising C-ImmSim to predict the immune simulation response to the vaccination design.C-ImmSim software conducted an in silico immune simulation research in response to delivering two doses of the vaccine design spaced five months apart.

Table 1 .
Sample of the datasets., f t , and o t are the input, forget, and output gates, and W and b are the weights and biases of the LSTM layer.x t is the input sequence at time t.h t and c t are the hidden state and cell state at time t.
b. Backward LSTM equations: where x' t is the input sequence at time t in the backward direction, h' t and c' t are the hidden state and cell state at time t in the backward direction, i' t , f ' t , and o' t are the input, forget, and output gates, and W' and b' are the weights and biases of the backward LSTM layer.

Table 4 .
Predictions of Vaxijen and ANTIGENpro tools for antigenicity.

Table 6 .
ToxinPred results for toxicity analysis.

Table 7 .
BLAST screening results against UniProtKB Human database.